-
Notifications
You must be signed in to change notification settings - Fork 45
feat: Add ContainerBackend with Docker and Podman #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @Fiona-Waters!
As we discussed here: #111 (comment), can we consolidate Podman and Docker under single container backend ?
Given that those backend should have similar APIs, I think it would be better to consolidate them, similar to KFP: https://www.kubeflow.org/docs/components/pipelines/user-guides/core-functions/execute-kfp-pipelines-locally/#runner-dockerrunner
Thanks @andreyvelich I will look at updating the implementation. |
|
@andreyvelich @astefanutti regarding comments on this PR and #111 this is what I propose: We have 3 backends:
For the Local Container backend we automatically try Docker first, then Podman and then fallback to Subprocess if neither runtime is available. We use the adapter pattern with a container client adapter unified interface, and docker and podman specific calls are implemented in separate adapter classes. |
|
Sure, that looks great @Fiona-Waters!
Why do we need to fallback to subprocess ? In the ContainerBackend users can select: |
|
@Fiona-Waters that sounds to me. I agree the fallback logic may really apply to choose the default container runtime. Other than that, I'd be inclined to drop the "Local" prefix entirely. Even Kubernetes could run local with KinD, and I doubt the SDK will ever do remote process. |
Understood. Let me see what I can do. Thank you for the swift reply! |
Ok cool. Let me see what I can do. Thank you! |
bdde877 to
d1288b2
Compare
|
@andreyvelich @astefanutti @briangallagher |
|
/ok-to-test |
defd7eb to
24f0bbd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fiona-Waters thanks for this awesome work!
That looks good to me overall.
/assign @kubeflow/kubeflow-sdk-team @briangallagher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Fiona-Waters!
I left my initial messages.
| @@ -0,0 +1,25 @@ | |||
| apiVersion: trainer.kubeflow.org/v1alpha1 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of installing the runtimes, can we just read the image version from GitHub dynamically ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me look into that. For offline support should we fall back to providing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fiona-Waters Do we require only image ?
We can fallback to the constant, that we define somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to hardcode constants that will then need to be updated manually later? Something like this:
DEFAULT_FRAMEWORK_IMAGES = {
"torch": "pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime",
}
This would then be used to create a default runtime, if the user doesn't provide a runtime url (and we have removed the sample one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed changes to use a default image if a user doesn't provide a URL and have removed the yaml. Please review. Thanks
3e1f243 to
e53ef9f
Compare
Signed-off-by: Brian Gallagher <[email protected]>
Signed-off-by: Fiona Waters <[email protected]>
Signed-off-by: Fiona Waters <[email protected]>
Signed-off-by: Fiona Waters <[email protected]>
… memory Signed-off-by: Fiona Waters <[email protected]>
e53ef9f to
7931f57
Compare
|
@andreyvelich I have addressed all comments, and rebased. Please take a look when you can. Thanks. |
6edf0e3 to
9bf4cb7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @Fiona-Waters!
Just a few small nits, overall looks great!
We should be ready to move this forward.
5df1c72 to
51cf1ed
Compare
Signed-off-by: Fiona Waters <[email protected]>
Signed-off-by: Fiona Waters <[email protected]>
51cf1ed to
525d4c9
Compare
Pull Request Test Coverage Report for Build 18998808027Details
💛 - Coveralls |
|
@andreyvelich I have addressed your most recent comments. We should be good to go now. Let me know if you want me to squash the commits. Thank you! |
|
Thanks for this great contribution @Fiona-Waters! |
|
Thanks @Fiona-Waters for this awesome work! /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: astefanutti The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Thanks @andreyvelich @astefanutti, delighted to have made this contribution. Thank you, and to all reviewers for their help. |
What this PR does / why we need it:
This PR introduces a unified ContainerBackend that automatically detects and uses either Docker or Podman for local training execution. This replaces the previous separate LocalDockerBackend and LocalPodmanBackend implementations with a single, cleaner abstraction. You can see the Docker and Podman implementations in separate commits.
This implementation tries Docker first, then falls back to Podman if Docker is unavailable. This can be overridden via ContainerBackendConfig.runtime to force a specific runtime ("docker" or "podman"). An error is raised if neither runtime is available.
Unit tests for the backend implementation have also been added. Examples for using Docker and Podman will be added to the Trainer repo later.
Manually testing on Mac I had to specify the container_host like so:
Docker via Colima
container_host=f"unix://{os.path.expanduser('~')}/.colima/default/docker.sock"Podman Desktop
container_host=f"unix://{os.path.expanduser('~')}/.local/share/containers/podman/machine/podman.sock"Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes ##114 and #108
Checklist:
I need to look at adding docs. A README has been included.